Suppress spurious "Failed to detach context" log in OTel interceptor#1596
Suppress spurious "Failed to detach context" log in OTel interceptor#1596brianstrauch wants to merge 4 commits into
Conversation
The tracing workflow interceptor attaches an OTel context at the start of execute_workflow/handle_query and detaches it in the finally. The workflow event loop runs portions of the workflow inside contextvars.copy_context().run(...), so the finally can execute in a copied contextvars.Context. A copy preserves the OTel context value (so the existing `context is get_current()` guard passes) but invalidates the attach token, so opentelemetry.context.detach logs "Failed to detach context". Because LogCapturer attaches to the process-global opentelemetry.context logger, that stray log also bled into test_opentelemetry_safe_detach when a tracing workflow tore down during its capture window, making it flaky. Route the best-effort detach through _safe_detach, which drops only that one log record via a scoped logging filter. It still calls opentelemetry.context.detach so attach/detach calls stay balanced, satisfying the leak invariant enforced by test_opentelemetry_context_restored_after_activity.
| # token. | ||
| if context is opentelemetry.context.get_current(): | ||
| opentelemetry.context.detach(token) | ||
| _safe_detach(token) |
There was a problem hiding this comment.
I don't think this is really a good change. Doing a bunch of extra custom actions to remove the log is pretty wacky, and the context actually cannot be different from the one it was created on, see the line directly above which was added for this purpose.
There was a problem hiding this comment.
the context actually cannot be different from the one it was created on
The term "context" is overloaded here: The OTel context is copied by value and doesn't change, but the runtime context does, which results in the log message. I agree this is not the greatest fix, so let me try pushing a fix for the root issue. I couldn't get it working the first time I tried, but the idea is that we could restore context by reattaching the previous context instead of detaching the attach token.
… attach The tracing interceptor attached an OTel context and detached it via opentelemetry.context.detach(token) in a finally. detach calls ContextVar.reset(token), which raises (and OTel logs "Failed to detach context") when the finally runs in a different contextvars.Context than the attach -- e.g. when the workflow event loop resumes inside contextvars.copy_context().run(...). A copy preserves the OTel context value (so the `context is get_current()` guard passes) but invalidates the token. Restore the previously-current context by re-attaching it instead of detaching the token. attach goes through ContextVar.set, which is valid in any contextvars.Context, so the failing reset is never reached and there is no spurious log to suppress. This replaces the _safe_detach logging-filter workaround entirely. Rework the leak test to model OTel's current-context ContextVar and assert it is restored to baseline, rather than counting attach/detach calls (which no longer balance once restoration goes through attach). Remove test_opentelemetry_safe_detach, which only exercised the deleted suppression.
context.attach's parameter is named `context`; the leak test's tracked_attach wrapper named it `ctx`, which mypy rejects when assigning the wrapper to context.attach (parameter names are part of a callable's type). Match the name and annotate with the imported Context type.
What
Flaky test observed here: https://github.com/temporalio/sdk-python/actions/runs/27432588766/job/81086412642?pr=1378
TracingWorkflowInboundInterceptorattaches an OTel context at the start ofexecute_workflow/handle_queryand detaches it in thefinally. The workflow event loop runs portions of the workflow insidecontextvars.copy_context().run(...)(e.g.wait_condition), so thatfinallycan execute in a copiedcontextvars.Context. A copy preserves the OTel context value — so the existingcontext is get_current()guard passes — but invalidates the attach token, soopentelemetry.context.detachfails internally and logs "Failed to detach context".Because
LogCapturerattaches a handler to the process-globalopentelemetry.contextlogger, that stray ERROR log bled intotest_opentelemetry_safe_detachwhenever a tracing workflow tore down during its capture window (same xdist worker process), making the test flaky. In production it's recurring ERROR-level noise on affected workflow completions.Fix
Route the best-effort detach through a small
_safe_detachhelper that drops only the"Failed to detach context"record via a scopedlogging.Filter.It still calls
opentelemetry.context.detach, so attach/detach call counts stay balanced — the leak invariant enforced bytest_opentelemetry_context_restored_after_activity. (Alternatives that avoid the publicdetach— restoring viaattach(previous), or resetting the runtime contextvar directly — both break that balance test, so suppressing at the logger is the only approach that satisfies balanced calls + no spurious log + no test weakening.)Caveat
During a detach the filter is attached to the process-global OTel context logger, so it would also drop an unrelated "Failed to detach context" emitted by other code in that brief window. Given the alternatives break a correctness invariant, this is the right trade-off.
Testing
tests/contrib/opentelemetry/andtests/contrib/openai_agents/test_openai_tracing.pypass together (27/27), including the attach/detach balance test.detachbut produces no record through_safe_detach.